Avoid writing tabular files within pipelines. #29

lkitching · 2018-06-04T14:50:13Z

Add implementations of the csv2rdf RowSource protocol which allow
transformed versions of pipeline input files to be passed directly.

The RowSource protocol represents tabular resources as a logical
sequence of records, each containing the source row number and parsed
data cells. Pipelines previously wrote transformed version of the
input files to disk so they could be passed to the CSVW process.

Add implementations of hte RowSource protocol which allow the
transformation process to be done in memory and presents the
transformed row records directly into the CSVW process.

The number of component specifications derived within cube-pipeline
is expected to be quite small. Load these into memory and add
a RowSource implementation which returns the corresponding tabular
rows to csv2rdf.

Update the tests which check the format of the intermediate
transformed data to used the transformed row sources.

Add implementations of the csv2rdf RowSource protocol which allow transformed versions of pipeline input files to be passed directly. The RowSource protocol represents tabular resources as a logical sequence of records, each containing the source row number and parsed data cells. Pipelines previously wrote transformed version of the input files to disk so they could be passed to the CSVW process. Add implementations of hte RowSource protocol which allow the transformation process to be done in memory and presents the transformed row records directly into the CSVW process. The number of component specifications derived within cube-pipeline is expected to be quite small. Load these into memory and add a RowSource implementation which returns the corresponding tabular rows to csv2rdf. Update the tests which check the format of the intermediate transformed data to used the transformed row sources.

Robsteranium · 2020-04-20T13:26:05Z

I guess this complements #27? Whereas that deals with the metadata, this deals with the tables themselves?

Sorry this hasn't been reviewed sooner @lkitching .

Is it still valid? I guess we'll need to update it to resolve merge conflicts. I wonder if there's any interaction now with #120?

Ideally we'll reach the point where we can run as either a) csv->csvw or b) csv->rdf to support both interop/ scrutability and overall efficiency respectively.

lkitching · 2020-04-20T14:43:22Z

@Robsteranium - I'm not sure we want to use this any more. We don't use #27 any more either within #120 since we always write the CSVW to disk. We could resurrect this approach in future since the infrastructure still exists within csv2rdf but it's probably more effort that it's worth for now.

Robsteranium · 2020-04-21T07:11:49Z

There was a reason for doing this though wasn't there... Was it an OOME or that it was faster without I/O? I can't remember!

Let's leave the PR and branch open in case we want to reintroduce it later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid writing tabular files within pipelines. #29

Avoid writing tabular files within pipelines. #29

lkitching commented Jun 4, 2018

Robsteranium commented Apr 20, 2020

lkitching commented Apr 20, 2020

Robsteranium commented Apr 21, 2020 •

edited

Loading

Avoid writing tabular files within pipelines. #29

Are you sure you want to change the base?

Avoid writing tabular files within pipelines. #29

Conversation

lkitching commented Jun 4, 2018

Robsteranium commented Apr 20, 2020

lkitching commented Apr 20, 2020

Robsteranium commented Apr 21, 2020 • edited Loading

Robsteranium commented Apr 21, 2020 •

edited

Loading